Meridian — Schemas & Embeddings from First Principles

Part 1

What Is a Schema?

A schema is a contract. It says: every piece of knowledge stored in this system will have these exact fields, in these exact formats, every single time.

Think of it like a form. If you go to a hospital, every patient record has the same fields: name, date of birth, blood type, allergies, medications. They don't let one doctor write "Bob, he's 40ish, allergic to something" and another doctor write a structured record. The form IS the schema. It enforces consistency.

Why does this matter for AI?

Your sovereign AI stores thousands of pieces of knowledge. Principles, tactics, lessons learned, frameworks. If each one is stored differently — some with a confidence score, some without, some with a source name, some with a hex ID, some with mechanism explanations, some without — you can never reliably search, compare, or transfer knowledge between systems.

The schema is the interoperability protocol. If two Meridian builds use the same schema, a codex pack created by one can be installed into the other without any translation. If they use different schemas, every transfer requires a conversion step — and every conversion step is a place where data gets lost or corrupted.

A concrete example

Say you extract a principle from a marketing course:

"Lead with outcomes, not mechanisms, when selling to skeptical men."

Stored as NODE_SCHEMA:
  id:               "a7f3b2c1-..."          (unique forever)
  text:             "Lead with outcomes..."  (the actual principle)
  title:            "Outcomes before mechanisms"
  node_type:        "principle"
  source_id:        "Anatomy of Ads 2.0"    (where it came from)
  confidence_score: 0.92                    (how validated it is)
  tags:             "cold_traffic,identity,masculine"
  mechanism:        "Skeptical men evaluate outcome identity before caring about how-to"
  situation:        "Cold traffic ads for identity-based offers"
  when_not:         "Warm retargeting where credibility is established"

Every single principle in the system has these same fields. You can search by confidence. You can filter by tags. You can retrieve by situation. You can compare mechanisms. You can track where it came from. The schema makes the knowledge machine-readable, not just human-readable.

What happens without a schema

Rob's early system stored knowledge as "holons" — semi-structured blobs with varying fields. Some had sources, some didn't. Some had categories, some were just raw text. When you want to ask "show me all principles about cold traffic with confidence above 0.8" — you can't, because some holons don't have a confidence field, and some don't have category tags.

A schema solves this by requiring every field to exist on every record, even if it's empty. You always CAN query confidence, even if some nodes are at the default 0.75 because they haven't been validated yet.

Part 2

The Schemas That Exist Right Now

Your system (VOHU MANAH) has 4 schemas because it evolved over time. Each was created for a different purpose.

Schema A: NODE_SCHEMA — The Main One

14 fields. Used by 7 collections. This is the rich, fully structured schema for actual knowledge — principles, tactics, examples, book excerpts. Every field is intentional:

Field	Why it exists
id	Unique forever. Never changes. Lets you reference a specific principle across systems.
vector	The embedding (we'll explain this in Part 3). Lets you search by meaning, not keywords.
text	The actual content. What gets embedded and what humans read.
title	Short label for display. "Outcomes before mechanisms" vs the full text.
node_type	What kind of knowledge this is. A principle (general truth), a tactic (specific action), a concept (abstract idea), an example (situated instance).
source_id	Where it came from. "Anatomy of Ads 2.0" — traceable back to the original material.
framework_id	Which cluster of related principles this belongs to. E.g. "cold_traffic" framework.
confidence_score	0.0 to 1.0. How validated this principle is. Starts at extraction quality. Rises when the principle works in the real world. Falls when it doesn't.
tags	Searchable labels. "copywriting,cold_traffic,identity" — comma-separated.
mechanism	HOW/WHY this works. The causal explanation. Not just "do this" but "this works because..."
situation	WHEN to apply this. The context where this principle is valid.
when_not	WHEN NOT to apply this. Just as important — prevents misapplication.
collection	Which collection it lives in (principles, tactics, etc).
date_added	When it was ingested. For tracking recency.

Schema B: LEGACY_SCHEMA — The Simple One

7 fields. Used by 4 collections. Created earlier, for simpler document storage (positioning docs, reference sites). Doesn't have mechanism, situation, when_not, framework_id, or even an id field. It's basically: text + source + confidence + tags.

Problem: Legacy collections can't participate in codex exchange because they're missing the fields that make knowledge useful (mechanism, when_not, situation). A codex buyer can't use a principle that says "do this" without knowing when to do it and when NOT to.

Schema C: CONV_SCHEMA — Conversations

9 fields. Stores full chat conversations as JSON blobs. Completely different purpose — this is session history, not knowledge. Not part of codex exchange.

Schema D: EVERGREEN_SCHEMA — Synthesis Pages

16 fields. Stores synthesized long-form content (trunk, branches, leaves, threads). This is the OUTPUT of the synthesis pipeline, not atomic knowledge. Not part of codex exchange as-is.

Rob's System (GHOSTNET)

Rob uses a different storage format entirely. His 16,717 holons are stored in LanceDB but with a different schema — less structured than NODE_SCHEMA, more like LEGACY. His holons have: text, vector (768-dim — different embedding model), and varying metadata. No standardized mechanism/situation/when_not fields.

This is why SPEC-001 matters: For Meridian to work — for codex packs to transfer between builds, for the collective to synthesize across nodes — everyone must use the same schema. NODE_SCHEMA is the candidate. It's the richest, most validated (6,797 nodes in production), and already enforced with hard validation (wrong function → ValueError).

Part 3

What Is an Embedding?

This is the most important concept to understand. Everything else flows from it.

The problem: computers can't understand meaning

A computer sees "the dog sat on the mat" and "the canine rested on the rug" as completely different strings. Different characters, different lengths. To a computer doing string comparison, these have zero similarity.

But to a human, they mean the same thing.

An embedding is a way to convert meaning into numbers. Specifically, into a list of numbers (a "vector") where similar meanings produce similar numbers.

How it works (simplified)

An embedding model is a neural network that has been trained on billions of text examples. It learned that "dog" and "canine" appear in similar contexts, so they should map to similar numbers. It learned that "the dog sat on the mat" and "investment banking regulations" appear in completely different contexts, so they should map to very different numbers.

When you feed text into an embedding model, it outputs a list of numbers. Like this:

"Lead with outcomes, not mechanisms"  →  [0.23, -0.15, 0.87, 0.02, -0.41, ... ] (1024 numbers)
"Show results before explaining how"  →  [0.21, -0.14, 0.85, 0.03, -0.39, ... ] (1024 numbers)
"How to change a car tire"            →  [-0.67, 0.33, -0.12, 0.55, 0.08, ... ] (1024 numbers)

The first two are about the same concept (outcome-first marketing). Their numbers are almost identical. The third is about something completely different. Its numbers are completely different.

Why 1024 numbers?

This is the dimension of the embedding. More dimensions = more nuance. Think of it like describing a color:

1 dimension: just brightness (0 = black, 1 = white). You can distinguish light from dark but not red from blue.
3 dimensions: red, green, blue. Now you can distinguish most colors.
1024 dimensions: you can distinguish incredibly subtle shades of meaning. "Outcomes before mechanisms in cold traffic" vs "outcomes before mechanisms in warm retargeting" — different vectors, because the embedding captures the contextual nuance.

768 dimensions (Rob's model) vs 1024 dimensions (your model) means your model captures slightly more nuance. Whether that matters depends on the data.

How search works with embeddings

When you ask "how do I sell to skeptical men?", the system:

Embeds your question into 1024 numbers
Compares those numbers to every stored principle's 1024 numbers
Returns the principles whose numbers are most similar to your question's numbers

This is semantic search — search by meaning, not keywords. You don't need to use the exact words that are in the stored principle. "How do I sell to skeptical men?" finds "Lead with outcomes, not mechanisms" because the embeddings capture the semantic relationship.

This is why the embedding model choice is critical. The quality of search, retrieval, codex integration, and collective synthesis all depend on the embedding model understanding the nuances of your domain. A bad embedding model will return irrelevant results. A good one will surface exactly what you need.

Part 4

Why Q and Rob's Systems Can't Merge Right Now

Q's system uses BGE-M3 — produces 1024 numbers per text.
Rob's system uses nomic-embed-text — produces 768 numbers per text.

These are not compatible. You cannot compare a list of 1024 numbers to a list of 768 numbers. It's like trying to compare a 3D object to a 2D photograph of it — they represent the same thing but in different dimensional spaces. The math doesn't work.

This means:

A codex created from Q's system (1024-dim) cannot be searched or compared in Rob's system (768-dim)
The collective synthesis layer can't merge principles from both systems
Any client build must use one or the other — mixing is impossible

This is the single most important infrastructure decision Meridian must make. Every build, every codex, every collective emission must use the same embedding model. Once you choose and clients start building, switching is astronomically expensive — you have to re-embed every single node across every single build.

Part 5

The Embedding Models Available

There are hundreds of embedding models. For Meridian, only a handful are realistic because we need: runs locally on CPU (sovereignty), open source (no API dependency), high quality (search must be accurate), and proven at scale.

Model	Dimensions	Size	Who	Quality (MTEB)	Speed (CPU)	Notes
BAAI/bge-m3	1024	1.3 GB	Beijing Academy of AI	Very high (top 5 on MTEB retrieval)	~0.5s per text on CPU	Q's current model. Multilingual. Supports dense + sparse + multi-vector. The most versatile option.
nomic-embed-text	768	274 MB	Nomic AI	Good (comparable to OpenAI ada-002)	~0.2s per text on CPU	Rob's current model. Smaller, faster. Open source. Less nuanced than BGE-M3.
BAAI/bge-large-en-v1.5	1024	1.2 GB	Beijing Academy of AI	High	~0.4s per text	English-only predecessor to BGE-M3. Slightly worse quality. Same dimensions.
sentence-transformers/all-MiniLM-L6-v2	384	80 MB	Sentence Transformers	Medium	~0.05s per text	Very fast, very small, but 384-dim means less nuance. Fine for simple search, not enough for Meridian's knowledge density.
Cohere embed-v3	1024	API only	Cohere	Very high	Fast (API)	Top quality but requires API — breaks sovereignty. Not viable for air-gapped clients.
OpenAI text-embedding-3-large	3072	API only	OpenAI	Highest	Fast (API)	Best quality available but API-only + closed source. Non-starter for sovereignty. Also 3072-dim = 3x storage cost.
Snowflake/arctic-embed-l	1024	1.1 GB	Snowflake	High	~0.4s per text	Strong retrieval performance. Open source. 1024-dim. Worth benchmarking against BGE-M3.
Alibaba/gte-Qwen2-7B-instruct	3584	14 GB	Alibaba	Near-best	Very slow on CPU	7B parameter model — runs as a full LLM. Highest quality local option but requires GPU and massive resources. Not practical for client builds.

Part 6

The Tradeoffs — Pros, Cons, Consequences

BGE-M3 (Q's choice) — 1024-dim, 1.3 GB

Pros	Cons
Top-tier retrieval quality on MTEB benchmarks Multilingual — works in 100+ languages (Will speaks 5) Supports dense, sparse, AND multi-vector search 1024-dim captures fine-grained semantic nuance Already validated with 6,797 nodes in production Open source (MIT license), runs on CPU Active development by BAAI	1.3 GB model — takes ~30s to load on first use ~0.5s per embedding on CPU (fine for query, slow for bulk ingestion) 1024-dim = more storage per node (4 KB per vector vs 3 KB for 768-dim) Not the fastest option available

nomic-embed-text (Rob's choice) — 768-dim, 274 MB

Pros	Cons
5x smaller (274 MB vs 1.3 GB) — loads faster, less RAM 2.5x faster per embedding (~0.2s vs ~0.5s) Good quality (comparable to OpenAI ada-002) Open source Native Ollama support (Rob's stack) 768-dim = less storage per node	Lower semantic resolution (768 vs 1024 dimensions) English-only — degraded quality for non-English text Worse on MTEB retrieval benchmarks than BGE-M3 768-dim is less standard — most modern models are moving to 1024+ No sparse or multi-vector support

Consequences of choosing wrong

If you pick a model and later need to switch:

Every single node — across every single build, every codex pack, every collective emission — must be re-embedded. For Q's current system, that's 6,797 nodes × 0.5s = ~1 hour. For a mature collective with 33 nodes at 10K principles each? 330,000 nodes × 0.5s = 46 hours of CPU time. Per build.

You cannot do a partial migration. Mixed embeddings are incompatible. It's all or nothing.

This is why you choose once and choose right.

Part 7

The Recommendation

BGE-M3 (1024-dim) is the right choice for Meridian

Here's why, factor by factor:

Factor	BGE-M3 wins?	Why
Quality	Yes	Higher MTEB scores. Better retrieval means better agent responses, better codex integration, better synthesis.
Multilingual	Yes	Will speaks French, Spanish, Russian, Italian. Clients may have knowledge in multiple languages. nomic is English-only.
Scalability	Yes	1024-dim is becoming the industry standard. Future models will likely output 1024+. Starting at 768 means migrating later.
Production validation	Yes	6,797 nodes, 27,338 edges, 60 evergreen frameworks already proven on BGE-M3. We know it works.
Speed	No	nomic is 2.5x faster. But 0.5s vs 0.2s per query is imperceptible to a human. Only matters for bulk ingestion.
Size	No	1.3 GB vs 274 MB. Matters on a Raspberry Pi. Doesn't matter on a machine with 64 GB RAM.
Storage	No	4 KB vs 3 KB per vector. At 10,000 nodes: 40 MB vs 30 MB. Negligible.

The speed and size advantages of nomic are real but irrelevant at Meridian's scale. The quality and multilingual advantages of BGE-M3 are decisive.

The migration path for Rob: Re-embed all 16,717 holons with BGE-M3. On his Mac hardware, this takes ~2-3 hours as a batch job. Run it once. Done. His holons keep all their content — only the vector field changes. Everything else (text, metadata, structure) is untouched.

What about future models?

New embedding models come out every few months. If a significantly better model appears in 2027, can we switch?

Theoretically yes, practically it's expensive. The cost is re-embedding everything. For the founding three, manageable (a few hours). For 33 nodes? A weekend project. For 100+? A major migration event.

The mitigation: the schema stores the raw text alongside the vector. You always have the original text. Re-embedding means reading every text field and running it through the new model. Nothing is lost — it's just compute time.

This is why storing the full text (not just the embedding) in NODE_SCHEMA is critical. The text is permanent. The embedding is a function of the text + the model. If the model changes, you regenerate. If the text is gone, you're dead.

Part 8

Rob's Current System in Detail

Rob's GHOSTNET uses a different stack at every level:

Component	Rob (GHOSTNET)	Q (VOHU MANAH)	Meridian Base Model
Embedding model	nomic-embed-text (768-dim)	BGE-M3 (1024-dim)	BGE-M3 (1024-dim)
Embedding via	Ollama API	sentence-transformers (Python)	sentence-transformers (Python)
Storage	LanceDB (memories.lance)	LanceDB (knowledge.db/)	LanceDB (knowledge.db/)
Schema	Semi-structured (varying fields)	NODE_SCHEMA (14 fields, enforced)	NODE_SCHEMA v3 (14+ fields, enforced)
Collections	4 (memories, dreams, synthesis, errors)	13 (7 node + 4 legacy + 2 custom)	TBD — minimum: knowledge + errors + dreams
Graph	None (flat holon structure)	SQLite kg_edges (27,338 edges)	SQLite kg_edges
Interface	AnythingLLM workspace	Telegram + 4 Dash apps + Copilot	TBD — likely Open WebUI + RAG plugin

What Rob needs to change for Meridian compatibility

Re-embed with BGE-M3 — batch job, ~3 hours. All holon content preserved, only vectors change.
Restructure holons to NODE_SCHEMA — map his fields to the standard 14 fields. Content that doesn't have a mechanism or when_not field gets those fields set to empty string. The schema requires the field to exist, not to be filled.
Add errors.lance and dreams.lance to the standard collection list — these are Rob's contribution to v3. They become standard collections in the base model.

His security hardening, dream engine, and swarm architecture are all above the schema layer. They don't need to change. The schema is the data format. The applications built on top of it are independent.

Part 9

The Meridian Schema Architecture (Proposed)

TIER 1: NODE_SCHEMA (the protocol — exchangeable)
  Every principle, tactic, concept, example, error, dream_insight.
  14 fields + v3 additions (gravity_score, validation_count, error_count).
  This is what codex packs contain.
  This is what gets emitted to the collective.
  This is the interoperability guarantee.
  Embedding: BGE-M3, 1024-dim, local CPU.

TIER 2: SYSTEM SCHEMAS (internal — never exchanged)
  CONV_SCHEMA     — conversation records (session history)
  EVERGREEN_SCHEMA — synthesis output pages
  SNAPSHOT_SCHEMA  — system vital signs over time
  AGENT_ACTIVITY   — per-agent activity logs (for dreaming)
  These never leave the sovereign node.
  Each client's system tables are their own business.

TIER 3: GRAPH SCHEMA (relationship layer — exchangeable)
  kg_edges        — connections between NODE_SCHEMA nodes
  Edges ARE part of codex packs (they're the knowledge structure).
  edge_id, from_id, to_id, rel_type, weight, notes, created_at

LEGACY_SCHEMA: RETIRE
  os_context, reference_sites → migrate to NODE_SCHEMA
  or mark as system-only (not codex-compatible)

The one rule: If it participates in codex exchange or collective synthesis, it must be NODE_SCHEMA with BGE-M3 1024-dim embeddings. Everything else is internal plumbing that each build can handle however it wants. The schema is the protocol. The protocol is the product.